97 research outputs found

    Homology Induction: the use of machine learning to improve sequence similarity searches

    Get PDF
    BACKGROUND: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000). RESULTS: We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. CONCLUSIONS: HI is a new technique for the detection of remote protein homolgy – a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method

    Pairwise learning to rank by neural networks revisited:reconstruction, theoretical analysis and practical performance

    Get PDF
    We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.Comment: 16 pages, 8 figure

    Clinicians' Views of Patient-initiated Follow-up in Head and Neck Cancer: a Qualitative Study to Inform the PETNECK2 Trial

    Get PDF
    Aims Current follow-up for head and neck cancer (HNC) is ineffective, expensive and fails to address patients' needs. The PETNECK2 trial will compare a new model of patient-initiated follow-up (PIFU) with routine scheduled follow-up. This article reports UK clinicians' views about HNC follow-up and PIFU, to inform the trial design. Materials and methods Online focus groups with surgeons (ear, nose and throat/maxillofacial), oncologists, clinical nurse specialists and allied health professionals. Clinicians were recruited from professional bodies, mailing lists and personal contacts. Focus groups explored views on current follow-up and acceptability of the proposed PIFU intervention and randomised controlled trial design (presented by the study co-chief investigator), preferences, margins of equipoise, potential organisational barriers and thoughts about the content and format of PIFU. Data were interpreted using inductive thematic analysis. Results Eight focus groups with 34 clinicians were conducted. Clinicians highlighted already known limitations with HNC follow-up – lack of flexibility to address the wide-ranging needs of HNC patients, expense and lack of evidence – and agreed that follow-up needs to change. They were enthusiastic about the PETNECK2 trial to develop and evaluate PIFU but had concerns that PIFU may not suit disengaged patients and may aggravate patient anxiety/fear of recurrence and delay detection of recurrence. Anticipated issues with implementation included ensuring a reliable route back to clinic and workload burden on nurses and allied health professionals. Conclusions Clinicians supported the evaluation of PIFU but voiced concerns about barriers to help-seeking. An emphasis on patient engagement, psychosocial issues, symptom reporting and reliable, quick routes back to clinic will be important. Certain patient groups may be less suited to PIFU, which will be evaluated in the trial. Early, meaningful, ongoing engagement with clinical teams and managers around the trial rationale and recruitment process will be important to discourage selective recruitment and address risk-averse behaviour and potential workload burden.Members of the PETNECK2 Research Team: A. Karwath; B. Main; C. Gaunt; C. Greaves; D. Moore; E. Watson; G. Gkoutos; G. Ozakinci; J. Wolstenholme; J. Dretzke; J. Brett; J. Duda; L. Matheson; L.-R. Cherrill; M. Calvert; P. Kiely; P. Gaunt; S. Chernbumroong; S. Mittal; S. Thomas; S. Winter; W. Won

    Vec2SPARQL:integrating SPARQL queries and knowledge graph embeddings

    Get PDF
    <div>Recent developments in machine learning have led to a rise of large</div><div>number of methods for extracting features from structured data. The features</div><div>are represented as vectors and may encode for some semantic aspects of data.</div><div>They can be used in a machine learning models for different tasks or to com-</div><div>pute similarities between the entities of the data. SPARQL is a query language</div><div>for structured data originally developed for querying Resource Description Frame-</div><div>work (RDF) data. It has been in use for over a decade as a standardized NoSQL</div><div>query language. Many different tools have been developed to enable data shar-</div><div>ing with SPARQL. For example, SPARQL endpoints make your data interopera-</div><div>ble and available to the world. SPARQL queries can be executed across multi-</div><div>ple endpoints. We have developed a Vec2SPARQL, which is a general frame-</div><div>work for integrating structured data and their vector space representations.</div><div>Vec2SPARQL allows jointly querying vector functions such as computing sim-</div><div>ilarities (cosine, correlations) or classifications with machine learning models</div><div>within a single SPARQL query. We demonstrate applications of our approach</div><div>for biomedical and clinical use cases. Our source code is freely available at</div><div>https://github.com/bio-ontology-research-group/vec2sparql and we make a</div><div>Vec2SPARQL endpoint available at http://sparql.bio2vec.net/</div

    Desiderata for the development of next-generation electronic health record phenotype libraries

    Get PDF
    BackgroundHigh-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.MethodsA group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.ResultsWe present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.ConclusionsThere are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains

    Ontology-based prediction of cancer driver genes

    Get PDF
    Abstract: Identifying and distinguishing cancer driver genes among thousands of candidate mutations remains a major challenge. Accurate identification of driver genes and driver mutations is critical for advancing cancer research and personalizing treatment based on accurate stratification of patients. Due to inter-tumor genetic heterogeneity many driver mutations within a gene occur at low frequencies, which make it challenging to distinguish them from non-driver mutations. We have developed a novel method for identifying cancer driver genes. Our approach utilizes multiple complementary types of information, specifically cellular phenotypes, cellular locations, functions, and whole body physiological phenotypes as features. We demonstrate that our method can accurately identify known cancer driver genes and distinguish between their role in different types of cancer. In addition to confirming known driver genes, we identify several novel candidate driver genes. We demonstrate the utility of our method by validating its predictions in nasopharyngeal cancer and colorectal cancer using whole exome and whole genome sequencing
    • …
    corecore